Goto

Collaborating Authors

 Optical Character Recognition


Meta-Album: Multi-domain Meta-Dataset for Few-Shot Image Classification

Neural Information Processing Systems

We introduce Meta-Album, an image classification meta-dataset designed to facilitate few-shot learning, transfer learning, meta-learning, among other tasks. It includes 40 open datasets, each having at least 20 classes with 40 examples per class, with verified licences. They stem from diverse domains, such as ecology (fauna and flora), manufacturing (textures, vehicles), human actions, and optical character recognition, featuring various image scales (microscopic, human scales, remote sensing). All datasets are preprocessed, annotated, and formatted uniformly, and come in 3 versions (Micro Mini Extended) to match users' computational resources.


Windows Photos adds fancy editing features from other Microsoft apps

PCWorld

Microsoft is adding ways to make the Windows Photos app much more powerful, combining elements of the elegant Designer app and making Photos more of a centerpiece for visual editing. Microsoft is taking optical-character recognition capabilities that it developed several years ago and adding them to Photos, while pulling in design elements from Microsoft Designer, too. Finally, the company is beefing up File Explorer a bit as well, giving it a more robust visual search capability. Unfortunately, it's also adding a Copilot button as well, which for now doesn't really do much. Microsoft's Windows Photos app languished for years, but it started enjoying a renaissance about two years ago with new AI-powered editing features.


Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Neural Information Processing Systems

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective.


StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Neural Information Processing Systems

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for endto-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.


Appendices for the Paper: pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning

Neural Information Processing Systems

We provide more details and experimental results for pFL-Bench in the appendices: Sec.A: the details of adopted datasets and models (e.g., tasks, heterogeneous partitions, and model architectures), and the extensions for other datasets and models with pFL-Bench. Besides, to demonstrate the potential and ease of extensibility of the pFL-bench, we also conducted experiments in the heterogeneous device resource scenario based on FedScale [38] (Sec.D.4), as well as experiments incorporating privacy-preserving techniques (Sec.D.5). We present detailed descriptions of the 12 publicly available dataset variants used in pFL-Bench. These datasets are popular in the corresponding fields, and cover a wide range of domains, scales, partition manners, and Non-IID degrees. The Federated Extended MNIST (FEMNIST) is a widely used FL dataset for 62-class handwritten character recognition [32]. The original FEMNIST dataset contains 3,550 clients and each client corresponds to a character writer from EMNIST [91]. Following [13], we adopt the sub-sampled version in FL-Bench, which contains 200 clients and totally 43,400 images with resolution of 28x28 pixels, and the dataset is randomly split into train/valid/test sets with ratio 3:1:1.


Supplementary Material of Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Neural Information Processing Systems

Details of the Model Architecture The detailed encoder architecture is depicted in Figure 7. Some implementation details that we use in the decoder, and the decoder architecture are depicted in Figure 8. We design the grouped 1x1 convolutions to be able to mix channels. For each group, the same number of channels are extracted from one half of the feature map separated by coupling layers and the other half, respectively. Figure 8c shows an example.


Meta-Album: Multi-domain Meta-Dataset for Few-Shot Image Classification

Neural Information Processing Systems

We introduce Meta-Album, an image classification meta-dataset designed to facilitate few-shot learning, transfer learning, meta-learning, among other tasks. It includes 40 open datasets, each having at least 20 classes with 40 examples per class, with verified licences. They stem from diverse domains, such as ecology (fauna and flora), manufacturing (textures, vehicles), human actions, and optical character recognition, featuring various image scales (microscopic, human scales, remote sensing). All datasets are preprocessed, annotated, and formatted uniformly, and come in 3 versions (Micro Mini Extended) to match users' computational resources.


SHDocs: A dataset, benchmark, and method to efficiently generate high-quality, real-world specular highlight data with near-perfect alignment

Neural Information Processing Systems

A frequent problem in vision-based reasoning tasks such as object detection and optical character recognition (OCR) is the persistence of specular highlights. Specular highlights appear as bright spots of glare that occur due to the concentrated reflection of light; these spots manifest as image artifacts which occlude computer vision models and are challenging to reconstruct. Despite this, specular highlight removal receives relatively little attention due to the difficulty of acquiring high-quality, real-world data. We introduce a method to generate specular highlight data with near-perfect alignment and present SHDocs--a dataset of specular highlights on document images created using our method. Through our benchmark, we demonstrate that our dataset enables us to surpass the performance of state-of-the-art specular highlight removal models and downstream OCR tasks.


One of the most frustrating problems at work: solved

Popular Science

It's 2025, and converting files from one format to another should only take a few clicks. But it often becomes a whole lengthy process requiring uploads to unsecured online converting apps that can put your personal information at risk. Usually, this PDF conversion license is 99.99, but right now, it's down to 23.99 when you use code SAVE20 at checkout. PDF Converter Pro works with Microsoft Word, Excel, PowerPoint, Text, HTML, PNG, and JPG files. It even maintains your original layouts, images, and hyperlinks even after conversion without losing quality.


Revisiting Noise in Natural Language Processing for Computational Social Science

arXiv.org Artificial Intelligence

Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.